Fix dataloading by streichgeorg · Pull Request #15 · HumeAI/wsds

streichgeorg · 2026-01-23T15:02:40Z

Add some type coercion (float? -> float64, int? -> int64)
Make WSsample.__contains__ look at shards loaded, prevents some errors when working with partially computed features.
Add fix for duplicate column names by dropping them (Not sure if this is what we want to do. Had to add it to make some of the pretraining datasets work. Maybe there is a better solution)

jpc · 2026-01-26T18:13:58Z

wsds/ws_sample.py

This unfortunately makes the test as expensive as loading the data with .get('column').

How do you use this downstream?

I have some datasets where certain transcripts are only computed partially. I could also catch the exception in my code, but I felt it is a bit counter intuitive if __contains__ succeeds, but .get() fails.

Feels like in most cases you want to do something like

if "col" in sample: # Do something with sample["col"] else: # Do something else

We could do the more thorough check if include_partial_shards is set on the dataset?

jpc · 2026-01-26T18:14:40Z

wsds/ws_dataset.py


        self._filter_dfs[filter_name] = filter_df

-        rows_satisfying_filter = filter_df.sum().item()


@rashishhume do you maybe know what's going on here?

I deleted this since, it caused a lot of log output at the start of my training jobs. Maybe it could make more sense to log this stuff higher up.

jpc · 2026-01-26T18:16:28Z

wsds/ws_dataset.py

        if shard_subsample != 1:
            shard_list = rng.sample(shard_list, int(len(shard_list) * shard_subsample))

+        # TODO: Not sure if we want to drop the columns. I think previously we


We could apply the renaming on SQL as well which would make it consistent with the non-SQL API.

I think most of the issues with duplicate fields were fixed recently when we added the select in this line:

df = scan_ipc(shard_path, glob=False).select(fields)

Do you remember which duplicate columns are causing you headaches?

It was language_whisper.txt I think, since this is contained in all the shards with Whisper transcripts.

Oh, interesting. We need to fix this then.

jpc · 2026-01-26T18:22:50Z

wsds/ws_dataset.py

                )

-        return exprs, pl.concat(row_merge).select(exprs)
+        def _common_dtype(col_name: str, a: pl.DataType, b: pl.DataType) -> pl.DataType:


My guess would be that this is about some shards having a null type because all the samples were None for a column?

Would concat with how="vertical_relaxed" help in this situation? (this would let Polars handle the coercion, hopefully in a sensible way)

The issue is that some metrics have shards stored as both float16 and float64.

I'll try vertical_relaxed (I remember some polars merge mode not handling the issue I was facing, not 100% sure that was vertical_relaxed)

streichgeorg · 2026-01-26T19:06:20Z

@jpc it seems like you already have better solutions for some of these issues. These we're mostly fixes I added when I was in the process of getting SFT to run. I'm fine with not merging this.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ultiple SQL expressions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Georg Streich added 7 commits January 15, 2026 02:20

Add .gitignore

0376b17

Fix missing indexing

1ff0139

Fix concat

444bcda

Upcast types, make contains stricter

741ed67

Remove uv lock file

2cc1d92

Merge remote-tracking branch 'origin/main' into georg/fixes

add5a59

Cleanup

6110877

jpc reviewed Jan 26, 2026

View reviewed changes

Georg Streich and others added 5 commits January 26, 2026 19:46

Make audio shards forward compatible

46fb136

Handle nulls in npy column

5b963ea

Add patches for shard subset

d34c682

Skip corrupt Arrow shards during indexing instead of crashing

ec9d906

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix duplicate column error in sql_select when same field appears in m…

a087da6

…ultiple SQL expressions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


		self._filter_dfs[filter_name] = filter_df

		rows_satisfying_filter = filter_df.sum().item()

Conversation

streichgeorg commented Jan 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

streichgeorg Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

streichgeorg Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

streichgeorg commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

streichgeorg Jan 26, 2026 •

edited

Loading

streichgeorg Jan 26, 2026 •

edited

Loading